Disassembly Techniques for Reverse 

Engineering 


Cedric Van Goethem Djairho Geuens 

Ghent University Ghent University 

cedric . vangoethem@ugent . be dj airho . geuens@ugent . be 

Bart Middag 
Ghent University 
bart . middag@ugent . be 

Abstract 


Plenty of programs used today still come in the form of a binary executable. These binaries can be 
reverse engineered to retrieve structural and semantic information about their execution. By recreating 
assembly code from binary executables, disassemblers provide the first step toward this goal. To protect 
the intellectual property of binary programs, obfuscation techniques were invented, thwarting these 
disassemblers. In this paper, we provide a survey of the most common disassembly techniques and their 
reaction to the obfuscations. We show that most obfuscations can be undone and that future obfuscation 
techniques will have to find ways to thwart these intelligent disassemblers. 


1. Introduction 

I n the process of reconstructing a program's 
source code from its binary executable, 
there are two consecutive steps to be taken. 
The first step is the regeneration of an assem- 
bly language program out of binary code. This 
step is called disassembly. When an assembly 
language program is available, the high-level 
source code can be reconstructed; this step is 
called decompilation. This survey paper focuses 
on the disassembly of binary executables. 

Disassembly is especially useful when the 
original source code of a program is unavail- 
able and modifications to the program have to 
be made. One can then translate the binary 
code into an assembly language program and 
modify it. Of course, disassembly can also be 
used for less legal activities, such as hacking. 

To protect code, software developers started 
using obfuscation techniques IfTTl IlHl l2ll l23l to 
impede proper understanding. While previous 
research in the domain of obfuscation had been 
focused on the decompilation phase, multiple 
obfuscation techniques for binary code were 
developed by targetting weaknesses of disas- 


semblers. Due to these developments, research 
in the domain of disassembly began to be fo- 
cused on obfuscated code in the early 2000s. 

Other papers on this topic often mix up re- 
search around successful disassembly of obfus- 
cated binaries and the deobfuscation of these 
binaries. A clear distinction has to be made as 
these are two very different topics. With this 
overview of widely used obfuscation and dis- 
assembly techniques, we hope this paper will 
help provide a fresh insight into the current 
state of disassembly and the problems disas- 
semblers must face. 

2. Basic disassembly approaches 

There are two main approaches to disassem- 
bling a binary executable, statically or dynami- 
cally. 

The static approach analyzes the binary 
code without actually running it and produces 
a complete program containing all possible exe- 
cution paths. It has the advantage of being able 
to analyze the entire file at once, but also has 
the disadvantage of being easier to mislead. 

The dynamic approach analyzes the binary 
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code by running it with a given input. This 
approach solves the problem that occurs in the 
static approach, but this also means that other 
possible execution paths than the one that is 
executed for the given input are not analyzed. 

We can distinguish two main static disas- 
sembly techniques j[25l : linear sweep and re- 
cursive traversal. 

Linear sweep starts at the beginning of the bi- 
nary file and decodes instructions sequentially. 
The algorithm does not take into account any 
semantical meaning of the disassembled in- 
structions. Hence, data in code will also be 
decoded as instructions. Recursive traversal 
builds up the control flow graph of the pro- 
gram by following the control transfers it en- 
counters. Each time a control transfer is identi- 
fied, the target addresses of the control transfer 
are determined and the disassembler continues 
decoding instructions at the potential targets 
of the control transfer. 


Disassembly 

I 


Static 


Dynamic 


Linear sweep Recursive traversal 

Figure 1: General disassembly techniques 


3. Key problems in static 

DISASSEMBLY 

Standard disassemblers make the following as- 
sumptions 11811141 : 

3.1. Linear sweep 

AO: All instructions are placed consecutively 
as bytes in the binary 


as code and wrongly recognize and extract in- 
structions from it. This problem can be solved 
using recursive traversal. 

3.2. Recursive traversal 

Al: Conditional transfer instructions always 
have two possible targets: a fall-through 
address and a target address. 

A2: Computed (indirect) control transfer in- 
structions can be analyzed (e.g. jump 
tables) 

A3: Function calls always return to the in- 
struction right after the call instruction. 

Some of these assumptions are not always 
completely justified. Assumption Al can be 
mislead by creating a conditional transfer that 
always follows one path and then placing mis- 
leading instructions or junk bytes in the other 
path. 

Also, assumption A2 is not always straight- 
forward in the case of indirect, pointer-based 
jumps. In Figure |2j we illustrate indirect con- 
trol transfers with a switch statement using a 
jump table (data section) inside the code. The 
selection of the case that has to be executed 
is performed using an indirect pointer based 
jump to an address in this jump table. 

Furthermore, assumption A3 is not nec- 
essarily true and might cause a static disas- 
sembler to further examine data and falsely 
recognize it as code. 

Another problem lies in the possibility of a 
program to modify its own instructions at run- 
time. This is called self-modifying code. When 
code modifies itself, a static analysis is useless 
for most of the time as it is not sure that the 
analyzed instructions will remain the same for 
different inputs. 

4. Obfuscation 


The key problem is that data blocks and 
code blocks can be intermixed. This causes the 
linear sweep technique to interpret data blocks 


We can obfuscate the assembly code by modi- 
fying it so it does no longer satisfy the assump- 
tions mentioned above. These modifications 
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9: 

mov 

eax, ebp 

b: 

sub 

eax, 0x1 

e : 

cmp 

eax, 0x1 

11 : 

ja 

la <default> 

13 : 

jmp 

DWORD PTR [eax*4+0x3c 

0000001a 

<default>: #default case 

la : 

mov 

eax, ebp 

lc: 

sub 

eax, 0x1 


#read index variable n 

#index minus one 

#check index 

#jump if index > 2 

#jump to address depending on n 


0000003c <table>: 

#data block 

with jump table 

3c: 44 00 00 00 

dd 

44 <C1> 

40: 4b 00 00 00 

dd 

4b <C2> 

00000044 <C1> : 

#first case 


44: bb 01 00 00 00 

mov 

ebx, 0x1 

49: eb f3 

jmp 

36 <return> 

0000004b <C2> : 

tsecond case 


4b: bb 01 00 00 00 

mov 

ebx, 0x1 

50: eb ec 

jmp 

36 <return> 


Figure 2: x86 Assembly code for a Fibonacci program 


do not alter the flow of the program, but make 
it significantly harder to analyze. 

4.1. Opaque predicates 

To invalidate the assumption that each condi- 
tional transfer has two targets, we can modify 
an unconditional control transfer by introduc- 
ing a condition that always evaluates to either 
true or false: an opaque predicate. This condi- 
tion can be quite complex based on properties 
in linear algebra, calculus, and more. While 
its value is known at compile time, it will still 
have to be evaluated at runtime SI- If the recur- 
sive traversal algorithm fails to recognize the 
opaque predicate, it will analyze both control 
paths. 

Additionally, junk bytes can be placed in 
the path that will not be executed. On CISC 
instruction sets, the disassembly of these junk 
bytes can cause a misalignment of instructions 
which can propagate further into the valid in- 
struction stream [18], causing instructions to 


be disassembled wrongly^ 

This obfuscation can be undone by analyz- 
ing whether the value of such predicates either 
depends on the program's input or is depen- 
dent only on the internal program structure f5| . 
When the latter is observed, we can assume 
that program input has no influence on the ex- 
ecution of the program's flow and the unused 
control path can be ignored. This technique is 
called use-dependence tracking. By successfully 
recovering the unconditional jumps, recursive 
traversal can reconstruct a correct control flow 
graph using the original methods mentioned 
above. 

Opaque predicates can also be detected by 
performing dynamic analysis. This will not 
reconstruct the whole assembly, but might in- 
struct a static disassembler. During runtime, 
the dynamic disassembler might discover that 
some control paths are never taken. This infor- 
mation can be forwarded to the static disassem- 
bler and mark such predicates as potentially 


It can be shown that the instructions will resynchronize, following a distribution similar to the Kruskal Count jlO| . 
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opaque. This is called hybrid disassembly lfl9l . 


4.2. Branch functions 

Branch functions are capable of completely hid- 
ing the control flow by invalidating the assump- 
tion that indirect control transfers can be ana- 
lyzed. 

This is done by replacing all control trans- 
fers by a call to a branch function as illustrated 
in Figure [3|o. This function computes the target 
address based on an input value or the caller 
address. While the possible target addresses 
might be visible to some recursive traversal dis- 
assemblers, most disassemblers will not be able 
to identify the correct target address without 
executing the computation. 



Figure 3: Disassembled control flow in a branch function 


Additionally, it is possible the obfuscate the 


code of the branch function in such a way that 
a static disassembler will not be able to dis- 
cover the set of potential targets of the branch 
function (iq, . . . b n ). This is illustrated in Fig- 
ure |3j:. 

By performing dynamic analysis, one can 
always correctly identify the target of a spe- 
cific call to the branch function. However, as 
dynamic analysis takes time and may only exe- 
cute part of the code, it is difficult to determine 
all possible targets. If one is only interested in 
modifying a part of the code, this method is 
however sufficient^ 

4.3. Call stack tampering 

As many disassemblers assume standard use of 
the call and ret instructions, obfuscation tech- 
niques often use these instructions to thwart 
disassemblers. Apart from non-returning calls, 
one can use call stack tampering for nonstan- 
dard control transfers with the ret instruction: 
pushing an address to the stack and returning 
is equivalent to an unconditional jump. 

Currently, the only two ways to statically 
determine the true target of a suspicious ret 
instruction are ret target prediction I f23] and 
use-dependence tracking for stack cells 1T5I151. 
Both methods work by analyzing past stack 
manipulations. If these methods fail, one has 
to resort to dynamic analysis to determine the 
target of the ret instruction. 

4.4. Exception-based control transfers 

Obscuring control flow can also be accom- 
plished by using exception-based control transfers: 
it is very hard to statically predict which in- 
structions might raise exceptions. Obfuscators 
can generate such instructions, which trigger 
a custom exception handler that, much like a 
branch function, computes the target address 

OS- 

Static analysis will often fail to recognize 
these exception-based control transfers and as a 
result, the exception handlers and target blocks 


2 Anti-debugging techniques exist to greatly increase the cost of dynamic analysis and demotivate hackers. However, 
disassemblers like OllyDbg t28l and IDA Pro t24l can hide themselves from simple anti-debugging techniques. 
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might not be disassembled. However, it is sim- 
ple to detect them with dynamic analysis: the 
debugger interface will identify all registered 
exception handlers when a fault occurs and in- 
form the disassembler of any exception-raising 
instructions |23ll . 

4.5. Embedded virtual machines 

Many packers today use embedded virtual ma- 
chines to hide sensitive code from hackers. An 
executable packed this way includes bytecode 
that cannot be executed without the code that 
implements the virtual machine. 

While some disassemblers might correctly 
disassemble the instructions that implement 
the virtual machine, both static and dynamic 
disassemblers will falsely recognize the sen- 
sitive code as data. Many approaches to 
this problem work by first reverse-engineering 
the code that implements the virtual machine 
and then disassembling the bytecode. These 
outside-in approaches often make many as- 
sumptions about the virtual machine that are 
easily sidestepped. 

However, Coogan and Debray |5] have pro- 
posed a novel method that disassembles the 
sensitive code inside-out by reconstructing the 
code based on the system calls the virtual ma- 
chine makes. 

4.6. Self-modifying code 

As previously explained, static disassembly 
cannot handle self-modifying code, because it 
can only analyze the code before any modifi- 
cations 0 This can be used to obtain further 
resistance against static disassemblers. For ex- 
ample, many commercial software titles use 
encrypted code that is decrypted (hence gen- 
erated) at runtime [23] . Decryption algorithms 
can range from a simple XOR operation to 
a sophisticated usage of cryptographic hash 
functions Ill- 

Using dynamic disassembly, it is possible 
to partially analyze the decrypted code. To 


analyze the full code, a hacker would have to 
manually undo the encryption and analyze the 
decrypted code. This holds true for the many 
other usages of self-modifying code. 


5. Conclusion 

Both disassembly and obfuscation are very ac- 
tive research domains. In the past decade, 
many other obfuscation techniques have been 
developed, like instruction overlapping IfTSl flOl , 
mmm. These developments have inspired 
numerous reactions in the domain of disassem- 
bly 11141 [271 l2ll Fl3l f23l . Currently, research on 
disassembly has gained the upper hand with 
recent research by S. Debray 0, but in the 
future, more obfuscation techniques will be 
developed to counter this. 

The challenge now lies in developing disas- 
sembly techniques that are resistant to all ob- 
fuscation techniques, even ones that have not 
been thought of yet. Such techniques would 
bring this cat-and-mouse game between the 
domains of obfuscation and disassembly to an 
end. However, it does not seem as if this is 
going to happen in the near future. It is likely 
that both domains will continue to be active 
and many more developments will be made 
on both sides in the years to come. 
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3 While self-modifying code alone does not prevent disassembly completely, it impedes understanding, effectively 
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7. Review report 


This review report is about the following paper: 

K. Coogan, G. Lu, and S. Debray. Deobfuscation of virtualization-obfuscated software: A semantics- 
based approach. In Proceedings of the 18th ACM Conference on Computer and Communications Security, 
CCS 'll, pages 275-284, New York, NY, USA, 2011. ACM 

7.1. Summary 

This paper proposes a novel technique for the deobfusction of virtualized obfuscated code. Virtual 
code is code based on an instruction set entirely generated by the obfuscator. The obfuscator 
translates the original assembly code into its virtual equivalent. At runtime this code is interpreted 
by a virtual machine embedded in the executable. The goal of this paper is to distinguish the 
virtual machine from the actual instructions and to reconstruct the original assembly code. 

The key problem with virtual code analysis is that there is no prior knowledge about the 
instructions and control transfers, due to the randomization of instruction set. Therefor, not 
many assumptions can be made. In this paper, the assumption is made that system calls can be 
correctly identified since they have to meet the ABI calling conventions. Using dynamic analysis, 
the algorithm tries to compute the origin of the system call arguments, which can be inferred from 
the ABI specification, and the instructions that indirectly influence their value. This is a novel 
technique called use-dependence tracking. From these dependence chains, a distinction can be 
made between instructions that directly influence system calls and those who don't. The former 
have a high chance to be the original assembly code and the latter are most likely the virtual 
machine's interpreter code. 

By making only one assumption about the system calls, this technique can be seen as a new 
general approach towards the deobfuscation of virtualized code. From their results, it is shown 
that their algorithm can correctly distinguish the virtual machine from the original code in most 
cases. It states that their technique defeats most of the current obfuscation techniques, but can be 
easily thwarted by introducing false dependences for the system calls. 


7 . 2 . Review 

This review format was based on the ACM TACO Journal review guidelines. 

General opinion: This paper solves a complex problem in the field of deobfuscation of vir- 
tualized code. It introduces a novel technique which might serve as a seminal paper for further 
research. The article is well written and the concepts are clearly explained in each section. The 
results are promising, but the choice of test cases is a bit poor. Further tests on larger binaries 
would be an interesting addition to the paper or a next publication. 

Does this paper present innovative ideas or material? Yes, such as: 

• Use dependence-analysis 

• A unified approach for virtual code analysis 

• Filtering virtualized code from the original assembly 
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Is the information in the paper sound, factual, and accurate? Yes. 


What are the major contributions of the paper? It proposes a virtual machine independent 
technique for deobufscation of virtual machine obfuscated binaries. 

Does this paper cite and use appropriate references? Yes. 

Is the treatment of the subject complete? Yes, all relevant control flow obfuscations are 
addressed. Therefor, this technique can be applied to real obfuscated binaries. 

Recommendations to the authors: 

• We suggest writing automated test cases so the analysis does not have to be done by hand. It 
would be interesting to see analysis on larger real world applications instead of hand-crafted 
toy programs. 

• The reason why some false positives occur during the analysis isn't quite clear. Is this 
because of intrinsic properties of this technique, or does it need further optimisations? This 
could use some further explanation as a guidance for future research. 

• Some more information towards further work that could be helpful to defeat the proposed 
obfuscation in section 4. 

• Publication of the source code and how to obtain it would be helpful for further research. 

Recommendation to the publisher: Accept. 
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